Software visible and controllable lock-stepping with configurable logical processor granularities

ABSTRACT

A processor is described. The processor includes model specific register space that is visible to software above a BIOS level. The model specific register space is to specify a granularity of a processing entity of a lock-step group. The processor also includes logic circuitry to support dynamic entry/exit of the lock-step group&#39;s processing entities to/from lock-step mode including: i) termination of lock-step execution by the processing entities before the program code to be executed in lock-step is fully executed; and, ii) as part of the exit from the lock-step mode, restoration of a state of a shadow processing entity of the processing entities as the state existed before the shadow processing entity entered the lock-step mode and began lock-step execution of the program code.

FIELD OF INVENTION

The field of invention pertains generally to the computing sciences,and, more specifically, to software visible and controllablelock-stepping with configurable logical processor granularities.

BACKGROUND

FIG. 1 shows a traditional multi-core processor 100 that supports“lock-step” execution of pairs of processing cores. As observed in FIG.1, the prior art processor 100 includes multiple modules 101 ofprocessing cores, where, each module includes four processing cores C0,C1, C2 and C3. The multi-core processor 100 is further arranged asmultiple tiles 102 where each tile is composed of multiple modules.

Certain types of software operations are very sensitive to datacorruptions (e.g., “bit-flips”). For example, in some cases, if acorruption occurs while data is being encrypted, the original datacannot be decrypted back to its exact original form.

Lock stepping is a hardware assisted approach for ensuring that asoftware process has been correctly executed without corruption. In thecase of lock-stepping, at least two identical processing cores (e.g.,C0/C1 or C2/C3) are loaded with a same initial state and begin executionof the same instruction sequence on the same data. Over the course oftheir execution, ideally, both processing cores will simultaneouslygenerate the same intermediate values, many of which are ultimatelywritten to any of a cache, memory, control register, I/O deviceregister, etc.

As observed in FIG. 2, the module that the cores belong to includesspecial hardware 201 to broadcast the input to one core C_(X) to theother core C_(Y) so they can operate the same code from the same state.The module also includes a pair of comparators 202 to compare theintermediate values. That is, an intermediate value produced by one ofthe cores C_(X) is twice compared with the corresponding intermediatevalue that was simultaneously produced by the other of the cores C_(Y)(the pair of comparators provides redundancy of the comparisonoperation). If the comparators yield different comparison results, oneof the comparators is not working correctly, and/or, the intermediatevalues were different.

The cores C_(X), C_(Y) continue to execute the instruction sequence withthe comparators 202 comparing the intermediate values that are generatedalong the way. After execution of the instruction sequence is complete,the pair of executions either deviated from one another or they did notdeviate from one another. In the case of the former (the pair ofexecutions deviated), either the comparators 202 yielded differentcomparison results for at least one intermediate value, and/or, thefinal resultants generated by the cores C_(X), C_(Y) at completion aredifferent. In the case of the latter (the pair of executions did notdeviate), the comparators 202 never yielded different comparison resultsand the final resultants generated by the cores C_(X), C_(Y) atcompletion are the same.

Referring back to FIG. 1, each module 101 is designed so that only C0and C1 can be a lock-step pair and only C2 and C3 can be a lock-steppair. The C0/C1 pair therefore have associated model specific register(MSR) space (not shown in FIG. 1 for illustrative ease) that specifieswhether C0/C1 have been lock step mode enabled or not (“machine specificregister” can also be used to refer to the acronym “MSR”). Likewise, theC2/C3 pair also have associated MSR space that specifies whether C2/C3have been lock step mode enabled or not. Thus, the lock-stepconfiguration of a module 101 can be one of four possible states: 1) nocores are lock-step enabled; 2) only C0/C1 are lock-step enabled; 3)only C2/C3 are lock-step enabled; 4) C0/C1 are lock step enabled andC2/C3 are lock-step enabled.

According to the design of the processor 100 of FIG. 1, if any pair ofcores are to be placed in lock step, lowest-level firmware/software(Basic Input Output Software (BIOS)) manipulates the aforementioned MSRspace of the module 101 to enable lock-step mode for the desired pair(s)of cores. In response to the write to the MSR space, the cores that arenewly lock-step enabled have their state saved and are placed into asleep state. After being put to sleep, both cores are configured toexecute the corruption sensitive instruction sequence (each core issetup with identical instructions and data). The cores are then woken upand they begin lock-step execution.

A problem with the processor of FIG. 1 is that lock-stepping activity iscontrolled by the BIOS which gives higher levels of software (e.g.,virtual machine monitors (VMMs), operating system (OS) instances,applications, etc.) little/no visibility into the lock-step activity.Here, during lock-step, one of the cores is deemed the active core whilethe other of the cores is deemed a “shadow” core. The active core, forinstance, is the core that is executing the thread that has thecorruption sensitive instruction sequence. The shadow core, by contrast,is a core that needs to be specially re-purposed to essentially doublecheck execution of the active core's thread.

As such, when lock-step mode is enabled, the threads that were executingon the shadow core suddenly have their core “disappear” (the shadow coreis permanently placed in lock-step so that it cannot be used, other thanfor lock-stepping). Such drastic changes in the apparent configurationof the underlying hardware can, in at least come cases, detrimentallyaffect the software (e.g., the pool of cores to which threads can bedispatched suddenly loses a core). Moreover, with lock-stepping beingcontrolled at the BIOS level, entering lock-step was more akin to a timeconsuming hardware reset.

Further still, again because lock-stepping was controlled at the BIOSlevel, once a core was placed in lock-step mode it could not be exitedfrom. Here, BIOS is a piece of software/firmware that runs at the starttime of a computer that is being powered on and does not run afterwards.As such, in BIOS initiated lockstep, cores are placed into lockstepearly on at the time of BIOS execution and remain in lock-stepthereafter (after being placed into lock-step BIOS ceases execution andis not available thereafter to remove the cores from lock-step).

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and notlimitation in the figures of the accompanying drawings, in which likereferences indicate similar elements and in which:

FIG. 1 shows a multiprocessor that can perform lock-step execution(prior art);

FIG. 2 shows processing cores and supporting hardware for lock-stepexecution;

FIG. 3a show an improved multiprocessor that can perform lock-stepexecution;

FIG. 3b shows a method that can be performed by the improvedmultiprocessor of FIG. 3 a;

FIG. 4 shows possible lock-step group configurations;

FIGS. 5a, 5b, 5c, 5d and 5e show lock-step groups having differentlogical processor granularities;

FIGS. 6a, 6b, 6c and 6d show model specific register space that isvisible to and write-able by software;

FIGS. 7a, 7b, 7c, 7d, 7e, 7f, 7g, 7h and 7i depict a process of enteringlock-step mode;

FIGS. 8a, 8b, 8c, 8d and 8e depict a process of exiting lock-step mode;

FIG. 9 is a block diagram illustrating processing components forexecuting instructions, according to some embodiments;

FIG. 10A is a block diagram illustrating both an exemplary in-orderpipeline and an exemplary register renaming, out-of-orderissue/execution pipeline according to some embodiments;

FIG. 10B is a block diagram illustrating both an exemplary embodiment ofan in-order architecture core and an exemplary register renaming,out-of-order issue/execution architecture core to be included in aprocessor according to some embodiments;

FIGS. 11A-B illustrate a block diagram of a more specific exemplaryin-order core architecture, which core would be one of several logicblocks (including other cores of the same type and/or different types)in a chip;

FIG. 11A is a block diagram of a single processor core, along with itsconnection to the on-die interconnect network and with its local subsetof the Level 2 (L2) cache, according to some embodiments;

FIG. 11B is an expanded view of part of the processor core in FIG. 11Aaccording to some embodiments;

FIG. 12 is a block diagram of a processor that may have more than onecore, may have an integrated memory controller, and may have integratedgraphics according to some embodiments;

FIGS. 13-16 are block diagrams of exemplary computer architectures;

FIG. 14 is a block diagram of a first more specific exemplary system inaccordance with some embodiment;

FIG. 15 is a block diagram of a second more specific exemplary system inaccordance with some embodiments;

FIG. 16 is a block diagram of a System-on-a-Chip (SoC) in accordancewith some embodiments;

FIG. 17 is a block diagram contrasting the use of a software instructionconverter to convert binary instructions in a source instruction set tobinary instructions in a target instruction set according to someembodiments.

DETAILED DESCRIPTION OF THE EMBODIMENTS

An improved approach therefore strives to make the lock-step activitymore visible to software that resides above the BIOS level and executesbeyond a computer's early on time. Here, software is commonly viewed asa “stack” of different functional levels. At a highest level isapplication software that instructs a computer to performparticular/customized tasks that an end user desires the computer toperform. Beneath the application software is an operating system thatthe application software invokes to use the computer's hardwareresources (e.g., non volatile mass storage, main memory, CPU resources,network interface(s), etc.). In, e.g., higher performance environments(e.g., datacenters), multiple operating system instances are configuredto operate on a VMM or hypervisor. Beneath the VMM/hypervisor in thestack hierarchy is BIOS and firmware. BIOS, as explained above, isfirmware/software that executes early on during a computer'sboot-up/bring-up. Firmware is program code that controls a specifichardware component.

Moreover, a computer platform typically has different “privilegelevels”. A software program, task or thread that is written to access orcontrol a sensitive hardware component (e.g., sensitive register space,sensitive regions of main memory, etc.) is supposed to be assigned ahigher privilege level, while, programs/tasks/threads that do not seekaccess or control of a sensitive hardware component are assigned a lowerprivilege level. BIOS, being software/firmware that executes closelywith the hardware, is typically allocated higher/highest privilegelevels. Also, certain threads/tasks of a VMM or operating system can beassigned higher/highest privilege levels. Common application software,however, is typically assigned a lower/lowest privilege level.

With BIOS being in control of lock-step activity and being assignedhigher/highest privilege levels, in the prior art approach, lock-stepactivity was not visible to software having lesser privilege levels(e.g., application software and/or certain VMM/OS tasks/threads). Here,with the improved approach, as described in more detail below, modelspecific registers (MSRs) are used to control and/or provide visibilityinto the lock-step activity. As such, by associating the MSRs with aparticular privilege level (including, if desired, lesser privilegelevels) any software above/beyond BIOS can be configured tocontrol/observe lock-step activity.

As such, to make the lock-step activity more visible to software thatresides above the BIOS level (e.g., if desired, softwareprograms/tasks/threads having less than a highest privilege level orlesser privilege level), in the improved approach, e.g., any of VMMs,hypervisors, OS instances, and/or application software, etc. are able tomonitor and/or control lock-step execution (nevertheless, in variousembodiments BIOS is also given the ability to observe and controllock-step activity). With software layers above BIOS having the abilityto monitor and control lockstep, lockstep mode can be dynamically andrepeatedly entered/exited during runtime at will (a processing entity isno longer permanently in lock-step after being placed in lock-step).

FIG. 3a shows an improved multi-core processor 300 whose cores can beplaced in lock-stop mode under higher level software control (e.g., VMM,OS instance, application) and/or lesser privilege level software control(e.g., application software, certain VMM/OS tasks/threads). Software istherefore less prone to being adversely affected by lock-step entry orexit, lock-step execution can be stopped before it has completed, and,the processing entities (e.g., cores) are not permanently placed inlock-step mode. Additionally, the delays associated with entry into andexit from lock-step mode are more consistent with power state changeshaving smaller delays as opposed to hardware resets having longerdelays.

In particular, according to various embodiments, lock-step entry/exit isakin to C6 power state exit/entry in which a core enters/exits a deepsleep mode. C6 power state entry/entry is generally associated with apower state where a core's state is externally saved to put it to sleepand its clocks are removed to save power. Power management software ofcomputing systems frequently invoke C6 entry/removal on a per corebasis. Thus, implementing the state saving/re-loading activity of boththe active and shadow cores for entry/exit to/from lock-step mode asakin to C6 entry/removal comfortably integrates lock-step hardwaresupport into existing software platforms from both the perspective offunctional cohesion (shadow cores do not disappear) and propagationdelay (less time is consumed entering/exiting to/from lock-step mode).

Furthermore, as explained in more detail immediately below, the range ofinstruction execution resources that can be placed into lock-step modein the improved multiprocessor of FIG. 3a is greatly expanded ascompared to the multi-processor of FIG. 1. In particular, as explainedin more detail below, individual instruction execution pipelines, cores,modules and tiles can all be placed into lock-step mode with asymmetrical peer. Further still, the improved processor 300 of FIG. 3ais able to break out of lock-step mode once lock-step mode has beenentered.

Importantly, the improved processor of FIG. 3a includes enhanced MSRspace 311, 312, 313, 314 that is visible to software and provides thesoftware with sufficient information to control the lock step activity.For ease of illustration, FIG. 3a shows register space 311, 312, 313,314 organized into four different MSR registers 311, 312, 313, 314.Other embodiments may exist where all of register space 311, 312, 313,314 is organized, e.g., into a single register, less than fourregisters, etc (here, register “space” corresponds to a field within aregister). For illustrative ease FIG. 3a also shows the register space311, 312, 313 and 314 only for a single instruction execution pipeline.As is known in the art, an instruction execution pipeline is afundamental hardware unit for the execution of an instructionsequence/thread.

In various embodiments, MSR registers are assigned different classes,where, a different class defines a different combination of softwarelevels that are permitted to access an MSR having that class (e.g.,class 1=BIOS, VMM and OS have permission to access; class 2=only BIOShas permission to access; class 3=only VMM and OS have permission toaccess; class 4=only system management mode (SMM) has permission toaccess). According to one implementation, the MSR registers 311, 312,313 and 314 are class 1 MSR registers and are therefore made accessibleto any of BIOS, VMM and OS software. In various other embodiments, otherclasses and/or access privileges can be assigned for the MSR registers311, 312, 313, 314 (e.g., class 2, class 3, a class that only allows VMMaccess, a class that only allows an OS permission). In yet otherembodiments, application software and/or certain tasks/threads of anapplication software program can be given a privilege level to access aclass 1 FSM (e.g., if desired by a user and allowed by the underlyingOS), or other high/higher privilege level.

In various embodiments, such register space 311, 312, 313, 314 existsfor each instruction execution pipeline that can participate inlock-step execution. Importantly, each core includes multipleinstruction execution pipelines (e.g., 8, 16, etc.). As such, accordingto various embodiments, one or more instruction execution pipelines percore are recognized as “processing entities” that are capable oflock-step execution with a corresponding one or more instructionexecution pipeline peers.

According to one approach, the lock-step partner/peer of an instructionexecution pipeline within a particular core can be configured to be thesame, corresponding instruction execution pipeline within another (e.g.,neighboring) core. For example, as observed in FIG. 4, there are Ninstruction execution pipelines per core. Here, the lock-step peer ofpipeline T0 in core C0 can be pipeline T0 in core C1 (e.g.,T0_C0=active, T0_C1=shadow), the lock-step peer of pipeline T1 in coreC1 can be pipeline T1 in core C1 (e.g., T1_C0=active, T1_C1=shadow),etc. Moreover, larger symmetrical lock-step groups having multipleactive and shadow processing entities can be configured (e.g.,T0/T1_C0=active, T0/T1_C1=shadow).

FIGS. 5a through 5e elaborate on further lock step groups that can bedefined with the improved processor of FIG. 3 a.

FIG. 5a shows three different lock-step group configurations 511, 512,513 across three different modules M0-M2 that can be arranged withinstruction execution pipelines as the processing entities. In a firstconfiguration 511 only one instruction execution pipeline per core is aprocessing entity for lock-step purposes. In this case, the pipelinefrom one core is the active processing entity and the pipeline from theother core is the shadow processing entity. Configuration of a lock-stepgroup consisting of only two pipelines may be suitable, for instance, ifthe corruption sensitive routine is a relatively simplistic singlethread software process.

A second configuration 512 duplicates the lock-step activity ofconfiguration 511 by including two different lock-step groups that bothconsume one pipeline as the active processing entity and anotherpipeline as the shadow processing entity. Here the active processingentity is a pipeline from C0 and C2 for the two different groups,respectively, and, the shadow processing entity is a pipeline from C1and C3 for the two different groups, respectively. This particularconfiguration 512 may be useful, for instance, if there are two isolatedand concurrent corruption sensitive processes that each consume a singlethread.

A third configuration 513 includes four pipelines in a single lock-stepgroup. Here, two pipelines in C0 are the active processing entitieswhile two, corresponding pipelines in C1 are the shadow processingentities. Configuration of a lock-step group consisting of fourpipelines as depicted in configuration 513 may be suitable, forinstance, if the corruption sensitive routine is a more complex routinethat, e.g., concurrently consumes two hardware threads.

Here, any of configurations 511, 512, 513 may be defined by programmingthe appropriate definition in the MSR register space 311, 312, 313 ofeach of the affected pipelines. Other combinations of lock-step groupsdefined at the pipeline level can be configurably defined, e.g., throughMSR register space as described in more detail below.

The number of pipelines per core that can be configured as part of alock-step group can also vary from implementation to implementation. Forexample, a first multi-core processor may be designed so that each/allof the pipelines in a core can be individually assigned as a processingentity within a lockstep group. By contrast, the cores of anothermulti-core processor (or other cores of the first multi-core processor)may be designed so that less than all of the pipelines in a core can beuniquely assigned as a processing entity within a lockstep group.

The configurations 511, 512, 513 of FIG. 5a indicate that active andshadow pipelines of a same lock-step group are never in a same core.Although this approach might eliminate corruptions associated withmanufacturing related defects (in which pipelines in a same core are aptto exhibit same corruptions), nevertheless, in alternate architectures,pipelines within a same core can be named active and shadow pipelines ofa same lock-step group.

Whereas FIG. 5a shows different lock-step groups having processingentities at pipeline granularity, by contrast, FIG. 5b shows threelock-step groups 521, 522, 523 having processing entities at coregranularity (“core granularity”). Configuration 521 shows a firstconfiguration where C0 and C1 form a lock-step group, configuration 522shows a second configuration where C2 and C3 form a lock-step group,and, configuration 523 shows a third configuration where C0 and C1 forma first lock-step group and C2 and C3 form a second lock-step group. Inconfiguration 523, the C0/C1 lock-step group execute a first, corruptionsensitive process while the C2/C3 lock-step group concurrently execute asecond, different corruption sensitive process. In an embodiment, when alock-step group is defined at core granularity, all pipelines within acore are available to execute instructions during lock-step execution.Such configurations can be appropriate under a number of circumstances,e.g., when the corruption sensitive process to be verified consumes anentire core.

FIG. 5c shows a lock step group configuration 531 having processingentities at module granularity (“module granularity”). Here, a lock-stepgroup is formed in which C0/C1/C2/C3 of a first module M0 correspond tothe active processing entity and C0/C1/C2/C3 of a second module M1correspond to the shadow processing entity. Even larger modulegranularity lock-step groups can be configured that consist of anadditional number of active modules and an equal number of additionalshadow modules. In an embodiment, when a lock-step group is defined atmodule granularity, all cores in a module, and all pipelines within acore are available to execute instructions during lock-step execution.

FIG. 5d shows another lock step group configuration 541 havingprocessing entities at tile granularity (“tile granularity”). Inparticular, FIG. 5d shows a first tile T0 acting as the activeprocessing entity and a second tile T1 acting as the shadow processingentity. Here, a tile consists of multiple modules. Larger tilegranularity lock-step groups can consist of an additional number ofactive tiles and an equal number of additional shadow tiles. In anembodiment, when a lock-step group is defined at tile granularity, allmodules in a tile, all cores in a module, and all pipelines within acore are available to execute instructions during lock-step execution.

FIG. 5e shows another lock step group configuration 551 havingprocessing entities at die granularity (“die granularity”). Inparticular, FIG. 5e shows a first die D0 acting as the active processingentity and a second die D1 acting as the shadow processing entity. Here,a die corresponds to an entire semiconductor chip and consists ofmultiple tiles. Larger die granularity lock-step groups can consist ofan additional number of active dies and an equal number of additionalshadow dies. In an embodiment, when a lock-step group is defined at diegranularity, all tiles within a die, all modules in a tile, all cores ina module, and all pipelines within a core are available to executeinstructions during lock-step execution.

In order to configure/define lock-step groups in any of thegranularities discussed above, in various embodiments, referring back toFIG. 3a , lock step group definition MSR space 311, 312, 313 exists foreach processing entity in the lock-step group. More formally, a firstMSR register 311 is referred to as the lock-step group definition (LSGD)MSR. A second MSR register 312 is referred to as the lock-step modeenable (LSME) MSR. A third MSR register 313 is referred to as the lockstep group state (LSGS) MSR. A fourth MSR register 314 is referred to asthe lock step break status (LSBS) MSR. Importantly, in variousembodiments, each of the MSRs are visible to software and, for certainregister space, can be written to by software so that software canunderstand and control the processor's lock step activity.

FIG. 6a shows an embodiment 611 of the LSGD MSR 611. According to anembodiment, there is one instance of the LSGD MSR for each processingentity in a lock step group. As observed in FIG. 6, a first field 601specifies the granularity level of the processing entity's lock-stepgroup (pipeline, core, module, tile or die). A second field 602specifies whether the processing entity is allowed to participate as anactive processing entity within the lock-step group. A third field 603specifies whether the processing entity is allowed to participate as ashadow processing entity within the lock step group. A fourth field 604identifies the peer/partner of the processing entity within the lockstep group.

In an embodiment, the LSGD MSR 611 is a read-only register thatspecifies what lock-stepping capability the underlying hardware isdesigned to support for the processing entity (e.g., the LSGD is anenumerate MSR that specifies machine capability). Here, the processorhardware needs to have the appropriate circuitry between processingentities of a same lock-step group in order for that lock-step group tophysically exist (comparators, state replication and broadcastcircuitry, etc.). Thus, the LSGD MSR 611 essentially describes theunderlying hardware processor design.

Here, in an embodiment, the processor hardware is designed such thatseparate instances of the LSGD MSR space 611 exist for each instructionexecution pipeline in a core that can operate as a processing entity ina lock-step group having pipeline granularity. The LSGD MSR space of oneof these pipelines also serves as the LSGD MSR space for the pipeline'score if the core is to be a processing entity within a lock-step grouphaving processing entities at core granularity. One of the “core” LSGDMSR instances amongst the cores in a same module also serves as the“module” LSGD MSR space for the module if the module is to be aprocessing entity within a lock-step group whose processing entitieshave module granularity.

The hierarchy then continues with one module level LSGD MSR amongstmultiple module level LSGD MSRs within a same tile being used as theLSGD MSR space for the module's tile when the tile is to be a processingentity within a lock-step group having tile granularity, and, one tilelevel LSGD MSR amongst multiple tile LSGD MSRs within a same die beingused as the LSGD MSR space for the die when the die is a processingentity within a lock-step group having die granularity.

As an example, FIG. 5b shows the corresponding LSGD MSR register space561 for configuration 521 of FIG. 5b . With respect to configuration 521of FIG. 5b , there is separate LSGD MSR space for both core C0 and coreC1. Here, the level field 601 and peer field 604 of these MSRs indicatethat a core granularity lock step group is formed from cores C0 and C1.The active field 602 and shadow field 603 of these MSRs indicate thatcore C0 is the active processing entity and core C1 is the shadowprocessing entity.

As another example, FIG. 5c shows the corresponding LSGD MSR registerspace 571 for configuration 531 of FIG. 5c . With respect toconfiguration 531 of FIG. 5c , there is separate LSGD MSR space for bothmodule M0 and module M1. Here, the level field 601 and peer field 604 ofthese MSRs indicate that a module granularity lock step group is formedfrom modules M0 and M1. The active field 602 and shadow field 603 ofthese MSRs indicate that module M0 is the active processing entity andmodule M1 is the shadow processing entity.

The role/use of the LSME and LSGS MSRs 312, 313 in conjunction with theuse of the LSGD MSR 311 is best explained through an example. As such,FIGS. 7a . through 7 i depict a method by which a lock-step group canenter lock-step mode. For ease of discussion, the method of FIGS. 7athrough 7i assumes a lock-step group of core granularity with cores C0and C2 being the active processing entities and cores C1 and C3 beingthe shadow processing entities (C1 is the shadow processing entity forC0 and C3 is the shadow processing entity for C2). However, consistentwith the discussion above of FIGS. 4 and 5 a-5 e, the reader shouldunderstand that lock-step groups can be formed having differentnumbers/combinations of active and shadow processing entities. Moreover,lock-step groups can be formed whose processing entities are defined ata granularity other than core granularity.

As observed in FIG. 7a , the method begins with the state of the LSGDMSR space 721 being defined in the processor. As observed in FIG. 7a ,the respective LGSD MSR 721 for cores C0 and C2 indicate that thesecores can only be active cores. By contrast, the respective LGSD MSR 721for cores C1 and C3 indicate that these cores can only be shadow cores.Moreover, the respective LGSD MSR 721 for C0 and C1 indicate that C0 andC1 are lock-step peers, and, that C2 and C3 are lock-step peers.

Then, as observed in FIG. 7b , software reads the LSGD MSR space 721 tounderstand the lock-step capabilities of the cores. The read of the LSGDMSR space 721 by software can be triggered by program code executing onC0 and/or C2 realizing that it is about to execute corruption sensitivecode. After this trigger event, the LGSD space for all cores (C0, C1,C2, C3) is read to understand their respective lock-step capabilities.

Referring to FIG. 7c , after the software understands the lock-stepcapabilities of the cores, the software begins to configure the desiredlock-step group in the LSME MSR space 722 consistent with theircapabilities. In particular, software executing on both of the shadowprocessing entities C1, C3 proceed to request entrance of lock step modeby writing to their respective lock-step mode enable (LSME) MSR registerspace 722.

An embodiment of the LSME MSR is depicted in FIG. 6b . As observed inFIG. 6b , the LSME MSR is a two bit MSR that reserves a first bit 611 toindicate whether or not the processing entity is in lock step mode, and,reserves a second bit 612 to indicate whether the processing entity isto be an active processing entity or is to be a shadow processingentity.

Here, referring back to FIG. 7c , LSME MSR register space 722 exists foreach processing entity, and, as explained in more detail immediatelybelow, each processing entity's state, in terms of being in lock stepmode or not being in lock step mode, is defined in part by the state ofits LSME MSR register space 722.

As observed in FIG. 7c , the processing entities that are to be shadowprocessing entities in the lock-step group (i.e., cores C1 and C3)request lock-step activation in their respective LSME MSR register space722 before the active processing entities C0, C2. Here, in variousembodiments, the active processing entities C0, C2 are already scheduledto execute the corruption sensitive instruction sequence as part oftheir normal/nominal software execution process. That is, for example,during nominal execution the active processing entities C0, C2 areexecuting instructions for, e.g., a first software application, and, theshadow processing entities C1, C3 are executing instructions for, e.g.,a second different software application that, e.g., has little/norelationship with the first software application. Alternatively, eitheror both of C1 and C3 can be idle. For ease of explanation the remainderof the discussion assumes C1 and C3 are actively executing instructionsjust before the lock-step group is formed.

The first software application is written or is otherwise configured torecognize when it is about to execute a corruption sensitive instructionsequence. As such, the first software application essentially requeststhe formation and activation of the lock-step group which results in thelock-step group definition being written into the LSME MSR space 722.After the lock-step group is formed and lock-step mode begins (asexplained further below), the first software program then goes forwardwith executing the corruption sensitive sequence on the activeprocessing entities C0, C2 as per nominal configuration (the activeprocessors are the processors assigned to execute the first softwareprogram).

By contrast, the second software program, having potentially norelationship to the first software program, essentially has to betemporarily parked so that its processing entities (the shadowprocessing entities C1, C3) can be dedicated to double-checking theactive processing entities C0, C2 in lock-step mode. Alternatively, thesecond software program may be rescheduled and placed onto differentcores if such cores are available. For ease of discussion, the remainingdiscussion will assume the second software program is parked.

Thus, whereas the first software program can be written or otherwiseconfigured to plan on lock-step mode when the corruption sensitiveinstructions are about to be executed, by contrast, the second softwareprogram receives an unexpected interrupt and needs to temporarily parkits execution.

Here, because the significance of the interrupt to the second softwareprogram is unpredictable, the sequence for activating lock-step modeinitially places the shadow processors C1, C3 in lock-step mode beforethe active processing entities C0, C2 to ensure that the shadowprocessing entities C1, C3 are, in fact, available for lock-step modeand can be properly configured to enter lock-step mode.

As such, as observed in FIG. 7d , after the shadow logic processors C1,C3 request activation in their respective LSME MSR register space 722,in response to the write to the LSME MSR, processor hardware begins toexternally save the state of the shadow processing entities and placethem into a special “wait for lock step” sleep state. In an embodiment,the saving of the state and the entry into the special sleep state isakin to a C6 entry in which, e.g., a check point is marked in the stateand the state is saved in on-die SRAM. Alternatively or in combinationthe check pointed state may be saved elsewhere (e.g., to cache, memoryor non volatile storage).

Meanwhile, the active processing entities C0, C2 observe the state oftheir respective lock-step group status (LSGS) MSR register space 723 tounderstand when the shadow processing entities C1, C3 are in the specialsleep state and ready for lock-step mode. FIG. 6c shows an embodiment ofthe LSGS MSR register space. Here, as explained in more detail below, afirst field 613 indicates whether all of the shadow processors in thelock step group have requested lock step mode entry and havesuccessfully saved their state and entered the special sleep state. Asobserved in FIG. 7d , the first field in the LSGS MSR space 723 is a 0which means not all of the shadow processors in the lock step group havesaved their state and are in the special sleep state ready to enterlock-step.

As observed in FIG. 7e , one shadow processing entity (C3) hascompletely saved its state and is in the sleep state ready to enterlock-step. However, the other shadow processing entity (C1) has not yetcompletely saved its state and entered the sleep state. As such, thefirst field of the LSGS MSR for both active processing entitiescontinues to indicate that the shadow processing entities are not yetready for lock-step mode entry (first field of LSGS MSR=0).

As observed in FIG. 7f , the remaining shadow logic processor (C1) hassuccessfully saved its state, at which point, the processor hardwareflips the bit in first field of the LSGS MSR 723 to indicate that theshadow processing entities are now in the special sleep state and readyto enter lock-step mode. In response to the bit flip, as observed inFIG. 7g , the respective software executing on both active processingentities C0, C2 request to be active processing entities by writing totheir respective LSME MSR register space 721. In response to the writeto the LSME MSR, the processor starts to externally save the state ofthe active processing entities C0, C2.

As observed in FIG. 7h , both active processing entities C0, C2 haveexternally saved their state and entered a sleep state, which, in turn,causes the hardware to flip the bit in the second field 614 of the LSGSMSR (observed in FIG. 6c ). The flipping of the bit in the second field614 of the LSGS MSR indicates that all processing entities have formallyentered lock-step mode (the lock-step group to which each of theprocessing entities belong, is active). Moreover, in an embodiment, theflipping of the bit causes the processor hardware to enable thelock-step comparators and broadcast logic between core peers.

In alternate embodiments, the set of processing entities do notsimultaneously enter lock-step mode, e.g., as a matter of definition orotherwise. For instance, according to one alternate embodiment, aprocessing entity is deemed to be in lock-step mode when its state hasbeen saved and it is placed in the sleep state. Nevertheless, lock-stepexecution is not allowed to begin until all processing entities are inlock-step mode.

The processor hardware, also in response to the flipping of the bit inthe second field 614 of the LSGS MSR, as observed in FIG. 7i , firstasserts reset for each of the C0, C1, C2 and C3 processing entities andthen, coming out of reset, configures the same register state andinstruction pointer configuration for C0 and C2, and, the same registerstate and instruction pointer configuration for C1 and C3 so that C0 andC2 start from the same program location and C1 and C3 start from thesame program location. The processors then being execution of theirrespective program sensitive code. Moreover, in an embodiment, unlikethe traditional processor of FIG. 1 in which the BIOS reset was akin toa hard, platform reset, by contrast, in the improved approach presentlydescribed, the resets of FIG. 7i are local/core resets which do notconsume as much time as a platform reset.

FIGS. 8a-8e continue with the above example and shows execution and shutdown of lock-step mode followed by re-entry of the shadow processorsback to their nominal operation. FIG. 8a shows both active processingentities and both shadow processing entities executing in lock-step.Referring to FIG. 8b , after execution of the corruption sensitiveprogram code is complete (ideally, all processors execute the lastinstruction of the corruption sensitive program code during the samemachine cycle), software on each processing entity writes to the firstfield of their respective LSME MSR register space 822 to requestdeactivation from lock-step mode.

In response to the requests being written into the LSME MSR space 822,as observed in FIG. 8c , the processor hardware operates to externallysave the state of each processing entity. In FIG. 8d , the lock step(LS) state of each processor has been externally saved. Here, theprocessor hardware and/or other software can study the saved state anddetermine that shadow processing entity execution was identical toactive processing entity execution. If not, an error flag is raised. Ifso, the process continues to FIG. 8 e.

As observed in FIG. 8e , after the lock-step state of each processingentity has been externally saved, the processor hardware resets each ofthe processing entities and, coming out of the reset, restores the stateof the active processing entities and the shadow processing entities. Inthe case of the active processing entities C0, C2, the state that wassaved after completion of the corruption sensitive program code isloaded back into the active processing entities C0, C2. By contrast, inthe case of the shadow processing entities C1, C3, the state that wasexternally saved in response to the shadow processing entity's initialrequest to enter lock-step mode (FIG. 7d ) is loaded back into theshadow processing entities. Also, again, the reset is a soft/local resetand not a hard platform reset.

As observed in FIG. 8e , after the correct respective state has beenrestored in the processing entities, the active and shadow processingentities formally exit lock step mode which resets the information inthe LSGS MSR 823. The active processing entities C0, C2 continueexecution of their thread(s) at the instruction(s) that follow thecorruption sensitive instruction sequence. By contrast, after therespective initial state has been restored in the shadow processingentities C1, C3, the shadow processing entities C1, C3 continueexecution of their thread(s) from the instruction(s) from the savedcheck point where execution was stopped to enter lock-step mode (FIG. 7c).

As with lock-step mode entry, in alternate embodiments, the set ofprocessing entities do not simultaneously exit lock-step mode, e.g., asa matter of definition or otherwise. For instance, according to onealternate embodiment, a processing entity is deemed to exit lock-stepmode when its lock-step execution state has been saved.

Notably, unlike the traditional processor of FIG. 1 described in theBackground that is not capable of stopping lock-step operation, bycontrast, the improved processor of FIG. 3a is designed to interruptlock-step activity at any time during lock-step mode if certain eventsoccur, and, report the interrupt through register space. Here, referringbriefly back to FIG. 3a , the LSBS MSR register space 314 is designed toreport that a break has occurred during lock-step mode/activity andprovide additional information concerning the cause of the break.

FIG. 6d shows a more detailed embodiment of the LSBS MSR. As observed inFIG. 6d , the LSBS MSR includes three different fields for threedifferent types of breaks. A first type of break occurs if any of theshadow processing entities in a lock-step group receive an interruptsignal or experience a similar event (e.g., general interrupt, nonmaskable interrupt (NMI), system management interrupt (SMI),initialization (INIT), startup inter processor interrupt (SIPI), machinecheck, doorbells, etc.). Breaks of this type are reported in a firstfield 614 in the LSBS MSR. A second type of break is initiated bysoftware and is reported in a second field 615. A third type of breakoccurs if a comparison made during lock-step indicates that two comparedvalues are unequal and is reported in a third field 616. Again, any ofthese breaks will cause lock-step mode/activity to end.

In the case of the third type of (miscompare) break event, in variousembodiments, the miscompare is characterized according to one of threedifferent characterizations: 1) uncorrectable errors, no action required(UENOA); 2) uncorrectable errors, software recovery required (UESRE);and, 3) uncorrectable errors (UC).

In the case of UENOA, the mis-compare error(s) did not cause changes topertinent (e.g., control) architectural state. As such, lock-step modecan be restarted. In the case of UESRE, there is an error with a memoryload/store or cache snoop transaction and the affected address isreported. In this case, lock-step mode can be continued if softwarecures the content of the affected address. In the case of UC, themis-compare error(s) caused changes to pertinent (e.g., control)architectural state and lock-step mode cannot be re-started. In anembodiment, more than one UESRE error results in the UC state becausethe address of only one of the memory transaction errors is reported. Inan embodiment, if the data emitted by the active and shadow processingentities do not match, the processor, in addition to triggering thelockstep break, can also mark the data as poisoned such that thedestination of the data (other cores, devices, etc.) can be alerted thatthis data is suspect and should not be consumed.

Although the discussion(s) above have emphasized detection ofcorruptions from a mis-compare of values during lock-step, in somescenarios, there can be corruption(s) within a processing entity that donot lead to a mismatching error outside the processing entities andtherefore go undetected by the lock-step scheme. In this case, theinternal state of processing entity peers can be different.

To detect internal mismatches between peers, in various embodiments, theprocessor, as part of the lock-step break process also: a) flushes allprocessing entity internal caches, internal state and architecturalstate to on-die SRAM or other storage outside of the core; and, b)places the processing entities into a sleep state from where thehardware can reset and reconfigure out of lock-step mode. Therespectively stored state of the peers can then be compared by thecomparator as part of the lockstep break action. Any mismatches can belogged as errors and any mismatching data poisoned.

Note that the teachings above can still be performed with processorshaving variations of the specific, exemplary processor described above(e.g., some processors may not perform a double comparison ofintermediate values).

Although embodiments above have stressed entry into lock-step mode forpurposes of verifying execution of corruption sensitive program code. Itis pertinent to mention that lock-step mode can be dynamicallyentered/exited for reasons other than such verification. For example, anerror scouting application that periodically executes itself in lockstepmode to detect if any permanent faults have developed. The errorscouting application itself does not have anything it cares to protectagainst corruption but uses the lockstep mode as a way to screen thehardware for defects.

The various processor operations described above can berealized/implemented with logic circuitry of the processor (e.g., one ormore dedicated hardwired logic circuitry (e.g., state machine logiccircuit(s)), field programmable gate array (FPGA), etc.) designed toperform these operations along with any supporting state keepingelements (e.g., registers, embedded memory (SRAM, eDRAM), caches,external memory, etc.). As such, in particular, referring back to FIG.3a , the improved processor also includes logic circuitry 320 to supportdynamic entry/exit of a lock-step group's processing entities to/fromlock-step mode including but not limited to: i) termination of lock-stepexecution by a lock step group's processing entities before lock-stepexecuted is completed; ii) as part of the exit from lock-step mode,restoring a state of a shadow processing entity as the state existedbefore the shadow processing entity entered lock-step mode and beganlock-step execution. Logic circuitry to perform any/all other processoroperations described above can also be represented by logic circuitry320 in FIG. 3 a.

Note that any of the writes to MSR space described above can beimplemented, in various embodiments, with a “write MSR” (WRMSR)instruction. Typically, execution of an WRMSR instruction entails thetransfer of information from general purpose register space to MSRspace. FIG. 3a depicts a high level view of an embodiment of an WRMSRinstruction 330. As observed in FIG. 3a , the WRMSR instruction 330includes an opcode field 331, a source field 332 and a destination field333. Consistent with the above description, in various embodiments: 1)the opcode field 331, e.g., specifies a move of contents from sourceregister space to a destination register space; 2) the source field 332identifies content within general purpose register space (e.g., theentire content of a pair of general purpose registers EDX and EAX(“EDX:EAX”)); and, 3) the destination field 333 identifies other contentwithin general purpose register space (e.g., the entire content of ECX)that identifies a specific lock-step MSR register 311, 312, 313, 314 orequivalent space within one or more MSRs. Upon execution of the WRMSRinstruction, the content identified by the source field 332 is writteninto the MSR space identified by the content in the destination field333. In other WRMSR embodiments, the instruction format of the WRMSRinstruction 330 does not include explicit source and destination fields(fields 332 and 333 are not technically present). Rather, the source anddestination are defined as part of the opcode definition (e.g., theopcode specified in field 331 is defined to read the source informationfrom EDX:EAX and write it to MSR space identified in ECX).

A method has been described above as depicted in FIG. 3b . As observedin FIG. 3b , the method is includes recognizing imminent execution ofcorruption sensitive program code 301. The method further includesidentifying active and shadow processing entities to execute thecorruption sensitive program code in lock-step 302. The method alsoincludes, before executing the corruption sensitive program code inlock-step, saving state information of a shadow processing entity of theactive and shadow processing entities 303. The method also includesexecuting the corruption sensitive program code in lock-step with theactive and shadow processing entities 304. The method also includesafter lock-step execution of the corruption sensitive program code bythe active and shadow processing entities is finished, restoring theshadow processing entity with the state information 305. The method isexecuted above a BIOS level.

Processing Components for Executing Instructions

FIG. 9 is a block diagram illustrating processing components forexecuting instructions, according to some embodiments. As illustrated,storage 901 stores instruction(s) 903 to be executed, including, e.g.,instructions that when executed perform any/all of the MSR registerwrite operations, and/or other operations discussed at length above, toeffect software visible and/or software controlled lock-step groupconfiguration, execution and/or implementation. As described furtherbelow, in some embodiments, computing system 900 is a SIMD processor toconcurrently process multiple elements of packed-data vectors, includingmatrices.

In operation, an instruction 903 is fetched from storage 901 by fetchcircuitry 905. The fetched instruction 907 is decoded by decodecircuitry 909. The instruction format, has fields (not shown here) tospecify locations of first, second, and destination vectors. Decodecircuit 909 decodes the fetched instruction 907 into one or moreoperations. In some embodiments, this decoding includes generating aplurality of micro-operations to be performed by execution circuitry(such as execution circuitry 917). The decode circuit 909 also decodesinstruction suffixes and prefixes (if used).

In some embodiments, register renaming, register allocation, and/orscheduling circuit 913 provides functionality for one or more of: 1)renaming logical operand values to physical operand values (e.g., aregister alias table in some embodiments), 2) allocating status bits andflags to the decoded instruction, and 3) scheduling the decodedinstruction 911 for execution on execution circuitry 917 out of aninstruction pool (e.g., using a reservation station in someembodiments).

Registers (register file) and/or memory 915 store data as operands ofthe decoded instruction 911 to be operated on by execution circuitry917. Exemplary register types, other than MSR registers, includewritemask registers, packed data registers, general purpose registers,and floating-point registers. In some embodiments, write back circuit919 commits the result of the execution of the decoded instruction 911.

Instruction Sets

An instruction set may include one or more instruction formats. A giveninstruction format may define various fields (e.g., number of bits,location of bits) to specify, among other things, the operation to beperformed (e.g., opcode) and the operand(s) on which that operation isto be performed and/or other data field(s) (e.g., mask). Someinstruction formats are further broken down though the definition ofinstruction templates (or subformats). For example, the instructiontemplates of a given instruction format may be defined to have differentsubsets of the instruction format's fields (the included fields aretypically in the same order, but at least some have different bitpositions because there are less fields included) and/or defined to havea given field interpreted differently. Thus, each instruction of an ISAis expressed using a given instruction format (and, if defined, in agiven one of the instruction templates of that instruction format) andincludes fields for specifying the operation and the operands. Forexample, an exemplary ADD instruction has a specific opcode and aninstruction format that includes an opcode field to specify that opcodeand operand fields to select operands (source1/destination and source2);and an occurrence of this ADD instruction in an instruction stream willhave specific contents in the operand fields that select specificoperands. A set of SIMD extensions referred to as the Advanced VectorExtensions (AVX) (AVX1 and AVX2) and using the Vector Extensions (VEX)coding scheme has been released and/or published (e.g., see Intel® 64and IA-32 Architectures Software Developer's Manual, September 2014; andsee Intel® Advanced Vector Extensions Programming Reference, October2014).

Exemplary Core Architectures, Processors, and Computer Architectures

Processor cores may be implemented in different ways, for differentpurposes, and in different processors. For instance, implementations ofsuch cores may include: 1) a general purpose in-order core intended forgeneral-purpose computing; 2) a high performance general purposeout-of-order core intended for general-purpose computing; 3) a specialpurpose core intended primarily for graphics and/or scientific(throughput) computing. Implementations of different processors mayinclude: 1) a CPU including one or more general purpose in-order coresintended for general-purpose computing and/or one or more generalpurpose out-of-order cores intended for general-purpose computing; and2) a coprocessor including one or more special purpose cores intendedprimarily for graphics and/or scientific (throughput). Such differentprocessors lead to different computer system architectures, which mayinclude: 1) the coprocessor on a separate chip from the CPU; 2) thecoprocessor on a separate die in the same package as a CPU; 3) thecoprocessor on the same die as a CPU (in which case, such a coprocessoris sometimes referred to as special purpose logic, such as integratedgraphics and/or scientific (throughput) logic, or as special purposecores); and 4) a system on a chip that may include on the same die thedescribed CPU (sometimes referred to as the application core(s) orapplication processor(s)), the above described coprocessor, andadditional functionality. Exemplary core architectures are describednext, followed by descriptions of exemplary processors and computerarchitectures.

Exemplary Core Architectures, Processors, and Computer Architectures

FIG. 10A is a block diagram illustrating both an exemplary in-orderpipeline and an exemplary register renaming, out-of-orderissue/execution pipeline according to some embodiments of the invention.FIG. 10B is a block diagram illustrating both an exemplary embodiment ofan in-order architecture core and an exemplary register renaming,out-of-order issue/execution architecture core to be included in aprocessor according to some embodiments of the invention. The solidlined boxes in FIGS. 10A-B illustrate the in-order pipeline and in-ordercore, while the optional addition of the dashed lined boxes illustratesthe register renaming, out-of-order issue/execution pipeline and core.Given that the in-order aspect is a subset of the out-of-order aspect,the out-of-order aspect will be described.

In FIG. 10A, a processor pipeline 1000 includes a fetch stage 1002, alength decode stage 1004, a decode stage 1006, an allocation stage 1008,a renaming stage 1010, a scheduling (also known as a dispatch or issue)stage 1012, a register read/memory read stage 1014, an execute stage1016, a write back/memory write stage 1018, an exception handling stage1022, and a commit stage 1024.

FIG. 10B shows processor core 1090 including a front end unit 1030coupled to an execution engine unit 1050, and both are coupled to amemory unit 1070. The core 1090 may be a reduced instruction setcomputing (RISC) core, a complex instruction set computing (CISC) core,a very long instruction word (VLIW) core, or a hybrid or alternativecore type. As yet another option, the core 1090 may be a special-purposecore, such as, for example, a network or communication core, compressionengine, coprocessor core, general purpose computing graphics processingunit (GPGPU) core, graphics core, or the like.

The front end unit 1030 includes a branch prediction unit 1032 coupledto an instruction cache unit 1034, which is coupled to an instructiontranslation lookaside buffer (TLB) 1036, which is coupled to aninstruction fetch unit 1038, which is coupled to a decode unit 1040. Thedecode unit 1040 (or decoder) may decode instructions, and generate asan output one or more micro-operations, micro-code entry points,microinstructions, other instructions, or other control signals, whichare decoded from, or which otherwise reflect, or are derived from, theoriginal instructions. The decode unit 1040 may be implemented usingvarious different mechanisms. Examples of suitable mechanisms include,but are not limited to, look-up tables, hardware implementations,programmable logic arrays (PLAs), microcode read only memories (ROMs),etc. In one embodiment, the core 1090 includes a microcode ROM or othermedium that stores microcode for certain macroinstructions (e.g., indecode unit 1040 or otherwise within the front end unit 1030). Thedecode unit 1040 is coupled to a rename/allocator unit 1052 in theexecution engine unit 1050.

The execution engine unit 1050 includes the rename/allocator unit 1052coupled to a retirement unit 1054 and a set of one or more schedulerunit(s) 1056. The scheduler unit(s) 1056 represents any number ofdifferent schedulers, including reservations stations, centralinstruction window, etc. The scheduler unit(s) 1056 is coupled to thephysical register file(s) unit(s) 1058. Each of the physical registerfile(s) units 1058 represents one or more physical register files,different ones of which store one or more different data types, such asscalar integer, scalar floating-point, packed integer, packedfloating-point, vector integer, vector floating-point, status (e.g., aninstruction pointer that is the address of the next instruction to beexecuted), etc. In one embodiment, the physical register file(s) unit1058 comprises a vector registers unit, a write mask registers unit, anda scalar registers unit. These register units may provide architecturalvector registers, vector mask registers, and general purpose registers.The physical register file(s) unit(s) 1058 is overlapped by theretirement unit 1054 to illustrate various ways in which registerrenaming and out-of-order execution may be implemented (e.g., using areorder buffer(s) and a retirement register file(s); using a futurefile(s), a history buffer(s), and a retirement register file(s); using aregister maps and a pool of registers; etc.). The retirement unit 1054and the physical register file(s) unit(s) 1058 are coupled to theexecution cluster(s) 1060. The execution cluster(s) 1060 includes a setof one or more execution units 1062 and a set of one or more memoryaccess units 1064. The execution units 1062 may perform variousoperations (e.g., shifts, addition, subtraction, multiplication) and onvarious types of data (e.g., scalar floating-point, packed integer,packed floating-point, vector integer, vector floating-point). Whilesome embodiments may include a number of execution units dedicated tospecific functions or sets of functions, other embodiments may includeonly one execution unit or multiple execution units that all perform allfunctions. The scheduler unit(s) 1056, physical register file(s) unit(s)1058, and execution cluster(s) 1060 are shown as being possibly pluralbecause certain embodiments create separate pipelines for certain typesof data/operations (e.g., a scalar integer pipeline, a scalarfloating-point/packed integer/packed floating-point/vectorinteger/vector floating-point pipeline, and/or a memory access pipelinethat each have their own scheduler unit, physical register file(s) unit,and/or execution cluster—and in the case of a separate memory accesspipeline, certain embodiments are implemented in which only theexecution cluster of this pipeline has the memory access unit(s) 1064).It should also be understood that where separate pipelines are used, oneor more of these pipelines may be out-of-order issue/execution and therest in-order.

The set of memory access units 1064 is coupled to the memory unit 1070,which includes a data TLB unit 1072 coupled to a data cache unit 1074coupled to a level 2 (L2) cache unit 1076. In one exemplary embodiment,the memory access units 1064 may include a load unit, a store addressunit, and a store data unit, each of which is coupled to the data TLBunit 1072 in the memory unit 1070. The instruction cache unit 1034 isfurther coupled to a level 2 (L2) cache unit 1076 in the memory unit1070. The L2 cache unit 1076 is coupled to one or more other levels ofcache and eventually to a main memory.

By way of example, the exemplary register renaming, out-of-orderissue/execution core architecture may implement the pipeline 1000 asfollows: 1) the instruction fetch 1038 performs the fetch and lengthdecoding stages 1002 and 1004; 2) the decode unit 1040 performs thedecode stage 1006; 3) the rename/allocator unit 1052 performs theallocation stage 1008 and renaming stage 1010; 4) the scheduler unit(s)1056 performs the schedule stage 1012; 5) the physical register file(s)unit(s) 1058 and the memory unit 1070 perform the register read/memoryread stage 1014; the execution cluster 1060 perform the execute stage1016; 6) the memory unit 1070 and the physical register file(s) unit(s)1058 perform the write back/memory write stage 1018; 7) various unitsmay be involved in the exception handling stage 1022; and 8) theretirement unit 1054 and the physical register file(s) unit(s) 1058perform the commit stage 1024.

The core 1090 may support one or more instructions sets (e.g., the x86instruction set (with some extensions that have been added with newerversions); the MIPS instruction set of MIPS Technologies of Sunnyvale,Calif.; the ARM instruction set (with optional additional extensionssuch as NEON) of ARM Holdings of Sunnyvale, Calif.), including theinstruction(s) described herein. In one embodiment, the core 1090includes logic to support a packed data instruction set extension (e.g.,AVX1, AVX2), thereby allowing the operations used by many multimediaapplications to be performed using packed data.

It should be understood that the core may support multithreading(executing two or more parallel sets of operations or threads), and maydo so in a variety of ways including time sliced multithreading,simultaneous multithreading (where a single physical core provides alogical core for each of the threads that physical core issimultaneously multithreading), or a combination thereof (e.g., timesliced fetching and decoding and simultaneous multithreading thereaftersuch as in the Intel® Hyperthreading technology).

While register renaming is described in the context of out-of-orderexecution, it should be understood that register renaming may be used inan in-order architecture. While the illustrated embodiment of theprocessor also includes separate instruction and data cache units1034/974 and a shared L2 cache unit 1076, alternative embodiments mayhave a single internal cache for both instructions and data, such as,for example, a Level 1 (L1) internal cache, or multiple levels ofinternal cache. In some embodiments, the system may include acombination of an internal cache and an external cache that is externalto the core and/or the processor. Alternatively, all of the cache may beexternal to the core and/or the processor.

Specific Exemplary In-Order Core Architecture

FIGS. 11A-B illustrate a block diagram of a more specific exemplaryin-order core architecture, which core would be one of several logicblocks (including other cores of the same type and/or different types)in a chip. The logic blocks communicate through a high-bandwidthinterconnect network (e.g., a ring network) with some fixed functionlogic, memory I/O interfaces, and other necessary I/O logic, dependingon the application.

FIG. 11A is a block diagram of a single processor core, along with itsconnection to the on-die interconnect network 1102 and with its localsubset of the Level 2 (L2) cache 1104, according to some embodiments ofthe invention. In one embodiment, an instruction decoder 1100 supportsthe x86 instruction set with a packed data instruction set extension. AnL1 cache 1106 allows low-latency accesses to cache memory into thescalar and vector units. While in one embodiment (to simplify thedesign), a scalar unit 1108 and a vector unit 1110 use separate registersets (respectively, scalar registers 1112 and vector registers 1114) anddata transferred between them is written to memory and then read back infrom a level 1 (L1) cache 1106, alternative embodiments of the inventionmay use a different approach (e.g., use a single register set or includea communication path that allow data to be transferred between the tworegister files without being written and read back).

The local subset of the L2 cache 1104 is part of a global L2 cache thatis divided into separate local subsets, one per processor core. Eachprocessor core has a direct access path to its own local subset of theL2 cache 1104. Data read by a processor core is stored in its L2 cachesubset 1104 and can be accessed quickly, in parallel with otherprocessor cores accessing their own local L2 cache subsets. Data writtenby a processor core is stored in its own L2 cache subset 1104 and isflushed from other subsets, if necessary. The ring network ensurescoherency for shared data. The ring network is bi-directional to allowagents such as processor cores, L2 caches and other logic blocks tocommunicate with each other within the chip. Each ring data-path is1112-bits wide per direction.

FIG. 11B is an expanded view of part of the processor core in FIG. 11Aaccording to some embodiments of the invention. FIG. 11B includes an L1data cache 1106A part of the L1 cache 1106, as well as more detailregarding the vector unit 1110 and the vector registers 1114.Specifically, the vector unit 1110 is a 16-wide vector processing unit(VPU) (see the 16-wide ALU 1128), which executes one or more of integer,single-precision float, and double-precision float instructions. The VPUsupports swizzling the register inputs with swizzle unit 1120, numericconversion with numeric convert units 1122A and 1122B, and replicationwith replication unit 1124 on the memory input. Write mask registers1126 allow predicating resulting vector writes.

FIG. 12 is a block diagram of a processor 1200 that may have more thanone core, may have an integrated memory controller, and may haveintegrated graphics according to some embodiments of the invention. Thesolid lined boxes in FIG. 12 illustrate a processor 1200 with a singlecore 1202A, a system agent 1210, a set of one or more bus controllerunits 1216, while the optional addition of the dashed lined boxesillustrates an alternative processor 1200 with multiple cores 1202Athrough 1202N, a set of one or more integrated memory controller unit(s)1214 in the system agent unit 1210, and special purpose logic 1208.

Thus, different implementations of the processor 1200 may include: 1) aCPU with the special purpose logic 1208 being integrated graphics and/orscientific (throughput) logic (which may include one or more cores), andthe cores 1202A-N being one or more general purpose cores (e.g., generalpurpose in-order cores, general purpose out-of-order cores, acombination of the two); 2) a coprocessor with the cores 1202A-N being alarge number of special purpose cores intended primarily for graphicsand/or scientific (throughput); and 3) a coprocessor with the cores1202A-N being a large number of general purpose in-order cores. Thus,the processor 1200 may be a general-purpose processor, coprocessor orspecial-purpose processor, such as, for example, a network orcommunication processor, compression engine, graphics processor, GPGPU(general purpose graphics processing unit), a high-throughput manyintegrated core (MIC) coprocessor (including 30 or more cores), embeddedprocessor, or the like. The processor may be implemented on one or morechips. The processor 1200 may be a part of and/or may be implemented onone or more substrates using any of a number of process technologies,such as, for example, CMOS, manufacturing technologies that use a gatedielectric other than silicon dioxide, FinFET manufacturingtechnologies, etc.

The memory hierarchy includes one or more levels of cache within thecores, a set of one or more shared cache units 1206, and external memory(not shown) coupled to the set of integrated memory controller units1214. The set of shared cache units 1206 may include one or moremid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), orother levels of cache, a last level cache (LLC), and/or combinationsthereof. While in one embodiment a ring based interconnect unit 1212interconnects the integrated graphics logic 1208 (integrated graphicslogic 1208 is an example of and is also referred to herein as specialpurpose logic), the set of shared cache units 1206, and the system agentunit 1210/integrated memory controller unit(s) 1214, alternativeembodiments may use any number of well-known techniques forinterconnecting such units. In one embodiment, coherency is maintainedbetween one or more cache units 1206 and cores 1202-A-N.

In some embodiments, one or more of the cores 1202A-N are capable ofmulti-threading. The system agent 1210 includes those componentscoordinating and operating cores 1202A-N. The system agent unit 1210 mayinclude for example a power control unit (PCU) and a display unit. ThePCU may be or include logic and components needed for regulating thepower state of the cores 1202A-N and the integrated graphics logic 1208.The display unit is for driving one or more externally connecteddisplays.

The cores 1202A-N may be homogenous or heterogeneous in terms ofarchitecture instruction set; that is, two or more of the cores 1202A-Nmay be capable of execution the same instruction set, while others maybe capable of executing only a subset of that instruction set or adifferent instruction set.

Exemplary Computer Architectures

FIGS. 13-16 are block diagrams of exemplary computer architectures.Other system designs and configurations known in the arts for laptops,desktops, handheld PCs, personal digital assistants, engineeringworkstations, servers, network devices, network hubs, switches, embeddedprocessors, digital signal processors (DSPs), graphics devices, videogame devices, set-top boxes, micro controllers, cell phones, portablemedia players, hand held devices, and various other electronic devices,are also suitable. In general, a huge variety of systems or electronicdevices capable of incorporating a processor and/or other executionlogic as disclosed herein are generally suitable.

Referring now to FIG. 13, shown is a block diagram of a system 1300 inaccordance with one embodiment of the present invention. The system 1300may include one or more processors 1310, 1315, which are coupled to acontroller hub 1320. In one embodiment the controller hub 1320 includesa graphics memory controller hub (GMCH) 1390 and an Input/Output Hub(IOH) 1350 (which may be on separate chips); the GMCH 1390 includesmemory and graphics controllers to which are coupled memory 1340 and acoprocessor 1345; the IOH 1350 couples input/output (I/O) devices 1360to the GMCH 1390. Alternatively, one or both of the memory and graphicscontrollers are integrated within the processor (as described herein),the memory 1340 and the coprocessor 1345 are coupled directly to theprocessor 1310, and the controller hub 1320 in a single chip with theIOH 1350.

The optional nature of additional processors 1315 is denoted in FIG. 13with broken lines. Each processor 1310, 1315 may include one or more ofthe processing cores described herein and may be some version of theprocessor 1200.

The memory 1340 may be, for example, dynamic random access memory(DRAM), byte addressable non-volatile memory, or a combination of thetwo. For at least one embodiment, the controller hub 1320 communicateswith the processor(s) 1310, 1315 via a multi-drop bus, such as afrontside bus (FSB), point-to-point interface such as QuickPathInterconnect (QPI), or similar connection 1395.

In one embodiment, the coprocessor 1345 is a special-purpose processor,such as, for example, a high-throughput MIC processor, a network orcommunication processor, compression engine, graphics processor, GPGPU,embedded processor, or the like. In one embodiment, controller hub 1320may include an integrated graphics accelerator.

There can be a variety of differences between the physical resources1310, 1315 in terms of a spectrum of metrics of merit includingarchitectural, microarchitectural, thermal, power consumptioncharacteristics, and the like.

In one embodiment, the processor 1310 executes instructions that controldata processing operations of a general type. Embedded within theinstructions may be coprocessor instructions. The processor 1310recognizes these coprocessor instructions as being of a type that shouldbe executed by the attached coprocessor 1345. Accordingly, the processor1310 issues these coprocessor instructions (or control signalsrepresenting coprocessor instructions) on a coprocessor bus or otherinterconnect, to coprocessor 1345. Coprocessor(s) 1345 accept andexecute the received coprocessor instructions.

Referring now to FIG. 14, shown is a block diagram of a first morespecific exemplary system 1400 in accordance with an embodiment of thepresent invention. As shown in FIG. 14, multiprocessor system 1400 is apoint-to-point interconnect system, and includes a first processor 1470and a second processor 1480 coupled via a point-to-point interconnect1450. Each of processors 1470 and 1480 may be some version of theprocessor 1200. In some embodiments, processors 1470 and 1480 arerespectively processors 1310 and 1315, while coprocessor 1438 iscoprocessor 1345. In another embodiment, processors 1470 and 1480 arerespectively processor 1310 coprocessor 1345.

Processors 1470 and 1480 are shown including integrated memorycontroller (IMC) units 1472 and 1482, respectively. Processor 1470 alsoincludes as part of its bus controller units point-to-point (P-P)interfaces 1476 and 1478; similarly, second processor 1480 includes P-Pinterface circuits 1486 and 1488. Processors 1470, 1480 may exchangeinformation via a point-to-point (P-P) interface 1450 using P-Pinterface circuits 1478, 1488. As shown in FIG. 14, IMCs 1472 and 1482couple the processors to respective memories, namely a memory 1432 and amemory 1434, which may be portions of main memory locally attached tothe respective processors.

Processors 1470, 1480 may each exchange information with a chipset 1490via individual P-P interfaces 1452, 1454 using point to point interfacecircuits 1476, 1494, 1486, 1498. Chipset 1490 may optionally exchangeinformation with the coprocessor 1438 via a high-performance interface1492. In one embodiment, the coprocessor 1438 is a special-purposeprocessor, such as, for example, a high-throughput MIC processor, anetwork or communication processor, compression engine, graphicsprocessor, GPGPU, embedded processor, or the like.

A shared cache (not shown) may be included in either processor oroutside of both processors yet connected with the processors via P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 1490 may be coupled to a first bus 1416 via an interface 1496.In one embodiment, first bus 1416 may be a Peripheral ComponentInterconnect (PCI) bus, or a bus such as a PCI Express bus or anotherthird generation I/O interconnect bus, although the scope of the presentinvention is not so limited.

As shown in FIG. 14, various I/O devices 1414 may be coupled to firstbus 1416, along with a bus bridge 1418 which couples first bus 1416 to asecond bus 1420. In one embodiment, one or more additional processor(s)1415, such as coprocessors, high-throughput MIC processors, GPGPU's,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessor, are coupled to first bus 1416. In one embodiment, second bus1420 may be a low pin count (LPC) bus. Various devices may be coupled toa second bus 1420 including, for example, a keyboard and/or mouse 1422,communication devices 1427 and a storage unit 1428 such as a disk driveor other mass storage device which may include instructions/code anddata 1430, in one embodiment. Further, an audio I/O 1424 may be coupledto the second bus 1420. Note that other architectures are possible. Forexample, instead of the point-to-point architecture of FIG. 14, a systemmay implement a multi-drop bus or other such architecture.

Referring now to FIG. 15, shown is a block diagram of a second morespecific exemplary system 1500 in accordance with an embodiment of thepresent invention. Like elements in FIGS. 14 and 15 bear like referencenumerals, and certain aspects of FIG. 14 have been omitted from FIG. 15in order to avoid obscuring other aspects of FIG. 15.

FIG. 15 illustrates that the processors 1470, 1480 may includeintegrated memory and I/O control logic (“CL”) 1572 and 1582,respectively. Thus, the CL 1572, 1582 include integrated memorycontroller units and include I/O control logic. FIG. 15 illustrates thatnot only are the memories 1432, 1434 coupled to the CL 1572, 1582, butalso that I/O devices 1514 are also coupled to the control logic 1572,1582. Legacy I/O devices 1515 are coupled to the chipset 1490.

Referring now to FIG. 16, shown is a block diagram of a SoC 1600 inaccordance with an embodiment of the present invention. Similar elementsin FIG. 12 bear like reference numerals. Also, dashed lined boxes areoptional features on more advanced SoCs. In FIG. 16, an interconnectunit(s) 1602 is coupled to: an application processor 1610 which includesa set of one or more cores 1602A-N, which include cache units 1604Athrough 1604N, and shared cache unit(s) 1606; a system agent unit 1610;a bus controller unit(s) 1616; an integrated memory controller unit(s)1614; a set of one or more coprocessors 1620 which may includeintegrated graphics logic, an image processor, an audio processor, and avideo processor; an static random access memory (SRAM) unit 1630; adirect memory access (DMA) unit 1632; and a display unit 1640 forcoupling to one or more external displays. In one embodiment, thecoprocessor(s) 1620 include a special-purpose processor, such as, forexample, a network or communication processor, compression engine,GPGPU, a high-throughput MIC processor, embedded processor, or the like.

Embodiments of the mechanisms disclosed herein may be implemented inhardware, software, firmware, or a combination of such implementationapproaches. Embodiments of the invention may be implemented as computerprograms or program code executing on programmable systems comprising atleast one processor, a storage system (including volatile andnon-volatile memory and/or storage elements), at least one input device,and at least one output device.

Program code, such as code 1430 illustrated in FIG. 14, may be appliedto input instructions to perform the functions described herein andgenerate output information. The output information may be applied toone or more output devices, in known fashion. For purposes of thisapplication, a processing system includes any system that has aprocessor, such as, for example; a digital signal processor (DSP), amicrocontroller, an application specific integrated circuit (ASIC), or amicroprocessor.

The program code may be implemented in a high level procedural or objectoriented programming language to communicate with a processing system.The program code may also be implemented in assembly or machinelanguage, if desired. In fact, the mechanisms described herein are notlimited in scope to any particular programming language. In any case,the language may be a compiled or interpreted language.

One or more aspects of at least one embodiment may be implemented byrepresentative instructions stored on a machine-readable medium whichrepresents various logic within the processor, which when read by amachine causes the machine to fabricate logic to perform the techniquesdescribed herein. Such representations, known as “IP cores” may bestored on a tangible, machine readable medium and supplied to variouscustomers or manufacturing facilities to load into the fabricationmachines that actually make the logic or processor.

Such machine-readable storage media may include, without limitation,non-transitory, tangible arrangements of articles manufactured or formedby a machine or device, including storage media such as hard disks, anyother type of disk including floppy disks, optical disks, compact diskread-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), andmagneto-optical disks, semiconductor devices such as read-only memories(ROMs), random access memories (RAMs) such as dynamic random accessmemories (DRAMs), static random access memories (SRAMs), erasableprogrammable read-only memories (EPROMs), flash memories, electricallyerasable programmable read-only memories (EEPROMs), phase change memory(PCM), magnetic or optical cards, or any other type of media suitablefor storing electronic instructions.

Accordingly, embodiments of the invention also include non-transitory,tangible machine-readable media containing instructions or containingdesign data, such as Hardware Description Language (HDL), which definesstructures, circuits, apparatuses, processors and/or system featuresdescribed herein. Such embodiments may also be referred to as programproducts.

Emulation (Including Binary Translation, Code Morphing, Etc.)

In some cases, an instruction converter may be used to convert aninstruction from a source instruction set to a target instruction set.For example, the instruction converter may translate (e.g., using staticbinary translation, dynamic binary translation including dynamiccompilation), morph, emulate, or otherwise convert an instruction to oneor more other instructions to be processed by the core. The instructionconverter may be implemented in software, hardware, firmware, or acombination thereof. The instruction converter may be on processor, offprocessor, or part on and part off processor.

FIG. 17 is a block diagram contrasting the use of a software instructionconverter to convert binary instructions in a source instruction set tobinary instructions in a target instruction set according to someembodiments of the invention. In the illustrated embodiment, theinstruction converter is a software instruction converter, althoughalternatively the instruction converter may be implemented in software,firmware, hardware, or various combinations thereof. FIG. 17 shows aprogram in a high level language 1702 may be compiled using an x86compiler 1704 to generate x86 binary code 1706 that may be nativelyexecuted by a processor with at least one x86 instruction set core 1716.The processor with at least one x86 instruction set core 1716 representsany processor that can perform substantially the same functions as anIntel processor with at least one x86 instruction set core by compatiblyexecuting or otherwise processing (1) a substantial portion of theinstruction set of the Intel x86 instruction set core or (2) object codeversions of applications or other software targeted to run on an Intelprocessor with at least one x86 instruction set core, in order toachieve substantially the same result as an Intel processor with atleast one x86 instruction set core. The x86 compiler 1704 represents acompiler that is operable to generate x86 binary code 1706 (e.g., objectcode) that can, with or without additional linkage processing, beexecuted on the processor with at least one x86 instruction set core1716. Similarly, FIG. 17 shows the program in the high level language1702 may be compiled using an alternative instruction set compiler 1708to generate alternative instruction set binary code 1710 that may benatively executed by a processor without at least one x86 instructionset core 1714 (e.g., a processor with cores that execute the MIPSinstruction set of MIPS Technologies of Sunnyvale, Calif. and/or thatexecute the ARM instruction set of ARM Holdings of Sunnyvale, Calif.).The instruction converter 1712 is used to convert the x86 binary code1706 into code that may be natively executed by the processor without anx86 instruction set core 1714. This converted code is not likely to bethe same as the alternative instruction set binary code 1710 because aninstruction converter capable of this is difficult to make; however, theconverted code will accomplish the general operation and be made up ofinstructions from the alternative instruction set. Thus, the instructionconverter 1712 represents software, firmware, hardware, or a combinationthereof that, through emulation, simulation or any other process, allowsa processor or other electronic device that does not have an x86instruction set processor or core to execute the x86 binary code 1706.

EXAMPLES

An apparatus has been described. The apparatus includes first modelspecific register (MSR) space to specify a granularity of a processingentity of a lock-step group of processing entities. The apparatusincludes second MSR space to specify whether the processing entity is anactive or shadow processing entity of the lock-step group of processingentities. The apparatus includes third MSR space to indicate that thelock-step group of processing entities is active. The first MSR space,the second MSR space and the third MSR space is accessible to at leastone of a virtual machine monitor, an operating system and an applicationsoftware program.

In various embodiments, the granularity is any of: instruction executionpipeline granularity; module granularity; tile granularity; and diegranularity. In various further embodiments the granularity alsocomprises core granularity.

In various embodiments the apparatus includes fourth MSR space toindicate when state information of a shadow processing entity of thelock step group of processing entities has been successfully saved.

In various embodiments, the first MSR space, the second MSR space andthe third MSR space is assigned a class that permits the first MSRspace, the second MSR space and the third MSR space to be accessed by atleast one of a virtual machine monitor and an operating system. Invarious further embodiments the class permits the first MSR space, thesecond MSR space and the third MSR space to be accessed by BIOS.

In various embodiments the apparatus includes fourth MSR space toprovide information that describes an event that caused a termination,prior to completion, of lock-step execution by the lock step group ofprocessing entities. In various further embodiments the information isable to describe any of the following: a) mis-compare during thelock-step execution; b) an interrupt has been received by a shadowprocessing entity of the lock-step group of processing entities; c) asoftware initiated interrupt has occurred. In various furtherembodiments the apparatus includes fifth MSR space that, if amis-compare during the lock-step execution caused the termination,provides even further information indicating any of: a) the lock-stepexecution can be restarted without software curing corrupted processingentity architectural state; b) the lock-step execution cannot berestarted without software curing corrupted processing entityarchitectural state; c) the lock-step execution cannot be restarted. Invarious further embodiments the apparatus is to mark data processed bythe lock-step group of processing entities as being poisoned.

In various embodiments the apparatus further includes logic circuitryto, commensurate with an exit from a lock-step mode: a) save and compareinternal cache and state information of lock-step peers; b) raise anerror if the compare results in a mis-compare.

In various embodiments the apparatus further comprises logic circuitryto, as part of an exit from a lock-step mode, restore a state of ashadow processing entity of the lock-step group of processing entitiesas the state existed before the shadow processing entity entered alock-step mode and began lock-step execution.

A computing system has been described. The computing system includes aprocessor having: (i) first model specific register (MSR) space tospecify a granularity of a processing entity of a lock-step group ofprocessing entities; (ii) second MSR space to specify whether theprocessing entity is an active or shadow processing entity of thelock-step group of processing entities; (iii) third MSR space toindicate that the lock-step group is active. The first MSR space, thesecond MSR space and the third MSR space is accessible to at least oneof a virtual machine monitor, an operating system and an applicationsoftware program. The computing system also includes a main memorycoupled to the processor and a network interface.

The computing system can also include any of the various embodiments andfurther embodiments described just above.

In various embodiments, the processor of the computing system is toexecute a write to MSR register instruction that writes to the secondMSR space to specify whether the processing entity is an active orshadow processing entity of the lock-step group of processing entities.

A method has been described. The method includes executing software at alevel above a BIOS level, the executing of the software includes:recognizing imminent execution of corruption sensitive program code;identifying active and shadow processing entities to execute thecorruption sensitive program code in lock-step; before executing thecorruption sensitive program code in lock-step, saving state informationof a shadow processing entity of the active and shadow processingentities; executing the corruption sensitive program code in lock-stepwith the active and shadow processing entities; and, after lock-stepexecution of the corruption sensitive program code by the active andshadow processing entities is finished, restoring the shadow processingentity with the state information.

A processor has been described. The processor includes model specificregister space that is visible to software above a BIOS level, the modelspecific register space to specify a granularity of a processing entityof a lock-step group. The processor includes logic circuitry to supportdynamic entry/exit of the lock-step group's processing entities to/fromlock-step mode including: i) termination of lock-step execution by theprocessing entities before the program code to be executed in lock-stepis fully executed; and, ii) as part of the exit from the lock-step mode,restoration of a state of a shadow processing entity of the processingentities as the state existed before the shadow processing entityentered the lock-step mode and began lock-step execution of the programcode.

What is claimed is:
 1. An apparatus, comprising: first model specificregister (MSR) space to specify a granularity of a processing entity ofa lock-step group of processing entities; second MSR space to specifywhether the processing entity is an active or shadow processing entityof the lock-step group of processing entities; third MSR space toindicate that the lock-step group of processing entities is active;wherein, the first MSR space, the second MSR space and the third MSRspace is accessible to at least one of a virtual machine monitor, anoperating system and an application software program.
 2. The apparatusof claim 1 wherein the granularity is any of: instruction executionpipeline granularity; module granularity; tile granularity; diegranularity.
 3. The apparatus of claim 2 wherein the granularity alsocomprises core granularity.
 4. The apparatus of claim 1 furthercomprising fourth MSR space to indicate when state information of ashadow processing entity of the lock step group of processing entitieshas been successfully saved.
 5. The apparatus of claim 1 wherein thefirst MSR space, the second MSR space and the third MSR space isassigned a class that permits the first MSR space, the second MSR spaceand the third MSR space to be accessed by at least one of a virtualmachine monitor and an operating system.
 6. The apparatus of claim 5wherein the class permits the first MSR space, the second MSR space andthe third MSR space to be accessed by BIOS.
 7. The apparatus of claim 1further comprising fourth MSR space to provide information thatdescribes an event that caused a termination, prior to completion, oflock-step execution by the lock step group of processing entities. 8.The apparatus of claim 7 wherein the information is able to describe anyof the following: a) mis-compare during the lock-step execution; b) aninterrupt has been received by a shadow processing entity of thelock-step group of processing entities; c) a software initiatedinterrupt has occurred.
 9. The apparatus of claim 8 further comprisingfifth MSR space that, if a mis-compare during the lock-step executioncaused the termination, provides even further information indicating anyof: a) the lock-step execution can be restarted without software curingcorrupted processing entity architectural state; b) the lock-stepexecution cannot be restarted without software curing corruptedprocessing entity architectural state; c) the lock-step execution cannotbe restarted.
 10. The apparatus of claim 9 wherein the apparatus is tomark data processed by the lock-step group of processing entities asbeing poisoned.
 11. The apparatus of claim 1 wherein the apparatusfurther comprises logic circuitry to, commensurate with an exit from alock-step mode: a) save and compare internal cache and state informationof lock-step peers; b) raise an error if the compare results in amis-compare.
 12. The apparatus of claim 1 wherein the apparatus furthercomprises logic circuitry to, as part of an exit from a lock-step mode,restore a state of a shadow processing entity of the lock-step group ofprocessing entities as the state existed before the shadow processingentity entered a lock-step mode and began lock-step execution.
 13. Acomputing system, comprising: a) a processor, the processor comprising:(i) first model specific register (MSR) space to specify a granularityof a processing entity of a lock-step group of processing entities; (ii)second MSR space to specify whether the processing entity is an activeor shadow processing entity of the lock-step group of processingentities; (iii) third MSR space to indicate that the lock-step group isactive; wherein, the first MSR space, the second MSR space and the thirdMSR space is accessible to at least one of a virtual machine monitor, anoperating system and an application software program; b) a main memorycoupled to the processor; and, c) a network interface.
 14. The computingsystem of claim 13 wherein the granularity is any of: instructionexecution pipeline granularity; module granularity; tile granularity;die granularity.
 15. The computing system of claim 13 wherein theprocessor is to execute a write to MSR register instruction that writesto the second MSR space to specify whether the processing entity is anactive or shadow processing entity of the lock-step group of processingentities.
 16. The computing system of claim 13 further comprising fourthMSR space to indicate when state information of a shadow processingentity of the lock step group of processing entities has beensuccessfully saved.
 17. The computing system of claim 13 wherein thefirst MSR space, the second MSR space and the third MSR space isassigned a class that permits the first MSR space, the second MSR spaceand the third MSR space to be accessed by at least one of a virtualmachine monitor and an operating system.
 18. The computing system ofclaim 17 wherein the class permits the first MSR space, the second MSRspace and the third MSR space to be accessed by BIOS.
 19. The computingsystem of claim 13 further comprising fourth MSR space to provideinformation that describes an event that caused a termination, prior tocompletion, of lock-step execution by the lock step group of processingentities.
 20. The computing system of claim 19 wherein the informationis able to describe any of the following: a) mis-compare during thelock-step execution; b) an interrupt has been received by a shadowprocessing entity of the lock-step group of processing entities; c) asoftware initiated interrupt has occurred.
 21. The computing system ofclaim 20 further comprising fifth MSR space that, if a mis-compareduring the lock-step execution caused the termination, provides evenfurther information indicating any of: a) the lock-step execution can berestarted without software curing corrupted processing entityarchitectural state; b) the lock-step execution cannot be restartedwithout software curing corrupted processing entity architectural state;c) the lock-step execution cannot be restarted.
 22. The computing systemof claim 21 wherein the processor is to mark data processed by thelock-step of processing entities as being poisoned.
 23. The computingsystem of claim 1 wherein the processor further comprises logiccircuitry to, commensurate with an exit from a lock-step mode: a) saveand compare internal cache and state information of lock-step peers; b)raise an error if the compare results in a mis-compare.
 24. Theapparatus of claim 13 wherein the processor further comprises logiccircuitry to, as part of an exit from a lock-step mode, restore a stateof a shadow processing entity of the lock-step group of processingentities as the state existed before the shadow processing entityentered a lock-step mode and began lock-step execution.
 25. A method,comprising: executing software at a level above a BIOS level, theexecuting of the software comprising: recognizing imminent execution ofcorruption sensitive program code; identifying active and shadowprocessing entities to execute the corruption sensitive program code inlock-step; before executing the corruption sensitive program code inlock-step, saving state information of a shadow processing entity of theactive and shadow processing entities; executing the corruptionsensitive program code in lock-step with the active and shadowprocessing entities; and, after lock-step execution of the corruptionsensitive program code by the active and shadow processing entities isfinished, restoring the shadow processing entity with the stateinformation.
 26. A processor comprising: a) model specific registerspace that is visible to software above a BIOS level, the model specificregister space to specify a granularity of a processing entity of alock-step group; and, b) logic circuitry to support dynamic entry/exitof the lock-step group's processing entities to/from lock-step modeincluding i) and ii) below: i) termination of lock-step execution by theprocessing entities before the program code to be executed in lock-stepis fully executed; and, ii) as part of the exit from the lock-step mode,restoration of a state of a shadow processing entity of the processingentities as the state existed before the shadow processing entityentered the lock-step mode and began lock-step execution of the programcode.